Image metrics in the statistical analysis of DNA microarray data

ABSTRACT

Expression profiling using DNA microarrays is an important new method for analyzing cellular physiology. In “spotted” microarrays, fluorescently labeled cDNA from experimental and control cells is hybridized to arrayed target DNA and the arrays imaged at two or more wavelengths. Statistical analysis is performed on microarray images and show that non-additive background, high intensity fluctuations across spots, and fabrication artifacts interfere with the accurate determination of intensity information. The probability density distributions generated by pixel-by-pixel analysis of images can be used to measure the precision with which spot intensities are determined. Simple weighting schemes based on these probability distributions are effective in improving significantly the quality of microarray data as it accumulates in a multi-experiment database. Error estimates from image-based metrics should be one component in an explicitly probabilistic scheme for the analysis of DNA microarray data.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of co-pending U.S. application Ser.No. 09/770,833, filed Jan. 25, 2001, entitled “IMAGE METRICS IN THESTATISTICAL ANALYSIS OF DNA MICROARRAY DATA,” the disclosure of which ishereby incorporated herein by reference in its entirety.

TECHNICAL FIELD

Aspects of the present invention relate to DNA microarray analysis, andmore particularly to systems and methods using error estimates fromimage based metrics to analyze microarrays.

BACKGROUND

Large-scale expression profiling has emerged as a leading technology inthe systematic analysis of cell physiology. Expression profilinginvolves the hybridization of fluorescently labeled cDNA, prepared fromcellular mRNA, to microarrays carrying up to 10⁵ unique sequences.Several types of microarrays have been developed, but microarraysprinted using pin transfer are among the most popular. Typically, a setof target DNA samples representing different genes are prepared by PCRand transferred to a coated slide to form a 2-D array of spots with acenter-to-center distance (pitch) of about 200 μm. In the budding yeastS. cerevisiae, for example, an array carrying about 6200 genes providesa pan-genomic profile in an area of 3 cm² or less. mRNA samples fromexperimental and control cells are copied into cDNA and labeled usingdifferent color fluors (the control is typically called green and theexperiment red). Pools of labeled cDNAs are hybridized simultaneously tothe microarray, and relative levels of mRNA for each gene determined bycomparing red and green signal intensities. An elegant feature of thisprocedure is its ability to measure relative mRNA levels for many genesat once using relatively simple technology.

Computation is required to extract meaningful information from the largeamounts of data generated by expression profiling. The development ofbioinformatics tools and their application to the analysis of cellularpathways are topics of great interest. Several databases oftranscriptional profiles are accessible on-line and proposals arepending for the development of large public repositories. However,relatively little attention has been paid to the computation required toobtain accurate intensity information from microarrays. The issue isimportant however, because microarray signals are weak and biologicallyinteresting results are usually obtained through the analysis ofoutliers. Pixel-by-pixel information present in microarray images can beused in the formulation of metrics that assess the accuracy with whichan array has been sampled. Because measurement errors can be high inmicroarrays, a statistical analysis of errors combined withwell-established filtering algorithms are needed to improve thereliability of databases containing information from multiple expressionexperiments.

The foregoing and other features and advantages of the invention willbecome more apparent upon reading the following detailed description andupon reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

FIG. 1 is a gene expression curve according to one embodiment of thepresent invention.

FIG. 2 is a graph illustrating the distribution of ratios calculated foreach pixel of several spots according to one embodiment of the presentinvention.

FIG. 3 is a graph illustrating the spot intensity standard deviationplotted against the average signal.

FIG. 4 is a graph plotting covariance as a function of varianceaccording to one embodiment of the invention.

DETAILED DESCRIPTION

The present invention involves the process of extracting quantitativelyaccurate ratios from pairs of images and the process of determining theconfidence at which the ratios were properly obtained. One commonapplication of this methodology is analyzing cDNA Expression Arrays(microarrays). In these experiments, different cDNA's are arrayed onto asubstrate and that array of probes is used to test biological samplesfor the presence of specific species of mRNA messages throughhybridization. In the most common implementation, both an experimentalsample and a control sample are hybridized simultaneously onto the sameprobe array. In this way, the biochemical process is controlled forthroughout the experiment. The ratio of the experimental hybridizationto the control hybridization becomes a strong predictor of induction orrepression of gene expression within the biological sample.

The gene expression ratio model is essentially, $\begin{matrix}{{Expression\_ Ratio} = \frac{Experiment\_ Expression}{{Wild} - {type\_ Expression}}} & {{Equation}\quad 1}\end{matrix}$

In typical fluorescent microarray experiments, levels of expression aremeasured from the fluorescence intensity of fluorescently labeledexperiment and wild-type DNA. A number of assumptions are made about thefluorescence intensity, including: 1) the amount of DNA bound to a givenspot is proportional to the expression level of the given gene; 2) thefluorescence intensity is proportional to the concentration offluorescent molecules; and 3) the detection system responds linearly tofluorescence.

By convention, the fluorescent intensity of the experiment is called“red” and the wild-type is called “green.” The simplest form of the geneexpression ratio is $\begin{matrix}{R = \frac{r}{g}} & {{Equation}\quad 2}\end{matrix}$

-   -   where r and g represent the number of experiment and control DNA        molecules that bind to the spot. R is the expression ratio of        the experiment and control.

In real situations, however, r and g are unavailable. The measuredvalues, r_(m) and g_(m), include an unknown amount of backgroundintensity that consists of background fluorescence, excitation leak, anddetector bias. That is,r _(m) =r+r _(b)  Equation 3g _(m) =g+g _(b)  Equation 4

-   -   where r_(b) and g_(b) are unknown amounts of background        intensity in the red and green channels, respectively.

Including background values, the gene expression ratio becomes:$\begin{matrix}{R = \frac{r_{m} - r_{b}}{g_{m} - g_{b}}} & {{Equation}\quad 5}\end{matrix}$

Equation 5 shows that solving the correct ratio R requires knowledge ofthe background intensity for each channel. The importance of determiningthe correct background values is especially significant when r_(m) andg_(m) are only slightly above than r_(b) and g_(b). For example, thegraph 100 in FIG. 1 shows the effect of a ten count error in determiningr_(b) or g_(b), in the case where R is known to be one. The expectedresult is shown as line 105. A ten count error in the denominator isshown as line 110. A ten count error in the numerator is shown as 115.

In experimental situations, the sensitivity of the gene expression ratiotechnique can be limited by background subtraction errors, rather thanthe sensitivity of the detection system. Accurate determination of r_(b)and g_(b) is thus a key part of measuring the ratio of weakly expressedgenes.

Rearranging equation 5, givesr _(m) =R(g _(m) −g _(b))+r _(b) =Rg _(m) +k  Equation 6

Wherek=r _(b) −Rg _(b.)  Equation 7

Least squares curve-fit of equation 6 can be used to obtain the best-fitvalues of R and k, assuming that r_(b) and g_(b) are constant for allspot intensities involved with the curve-fit. The validity of thisassumption depends upon the chemistry of the microarray. Otherbackground intensity subtraction techniques, however, can have moresevere limitations. For example, the local background intensity is oftena poor estimate of a spot's background intensity.

Two approaches have been taken to the selection of spots involved withthe background curve-fit. Since most microarray experiments containthousands of spots, of which only a very small percent are affected bythe experiment, it is probably best to curve-fit all spots in themicroarray to equation 6. A refinement of this method is to use allspots that have no process control defects. Another alternative is toinclude ratio control spots within the array and use only those forcurve fittings. The former two are preferred, because curve fittingeither the entire array or at least much of it yields a strongstatistical measurement of the background values. In the case where theexperiment affects a large fraction of spots, however, it may benecessary to use ratio control spots.

Constant k is interesting because it consists of a linear combination ofall three desired values. While it is not possible to determine uniquevalues of rb and gb from the curve-fit, there are types of constraintsthat can be used to select useful values. First, the background valuesmust be greater than the bias level of the detection system and lessthan the minimum values of the measured data. That is,

Constraint 1:r _(bias) ≦r _(b) ≦r _(m) _(min)   Equation 8g _(bias) ≦g _(b) ≦g _(m) _(min)   Equation 9

The second type of constraint is based on the gene expression model. Forgenes that are unaffected by the experiment and are near zeroexpression, both the experiment and the control expression level shouldreach zero simultaneously. In mathematical terms, when r→0, then g→0. Alinear regression of (r_(m)-r_(b)) versus (g_(m)-g_(b)) should thenyield a zero intercept. That is, selection of appropriate r_(b), g_(b)should yield linear regression of(r _(m) −r _(b))=m(g _(m) −g _(b))+b  Equation 10

-   -   such that b is approximately zero. This occurs when        b=mg _(b) −r _(b)  Equation 11

The pair of values r_(b) and g_(b) that create a zero intercept of thelinear regression is thus the second constraint that can be used forextracting the background subtraction constants. Solving equations 7 and11 for g_(b) gives

Constraint 2: $\begin{matrix}{g_{b} = \frac{k + b}{m - R}} & {{Equation}\quad 12}\end{matrix}$

-   -   where R and k come from the best-fit of 6, and m and b come from        linear regression of 10. The background level r_(b) can then be        calculated by inserting g_(b) into equation 7 or 11.

Although it is almost always possible to generate a curve-fit of themicroarray spot intensities, it is not always possible to satisfyconstraints 1) and 2), especially at the same time. Failure to satisfyconstraint 1) is an indication that the experiment does not fit theexpected ratio model or that one of the linearity assumptions is untrue.

A somewhat trivial explanation of a failure to satisfy the constraintsis that the spot intensities have been incorrectly determined. A commonway that this happens is that the spot locations are incorrectlydetermined during the course of analysis.

Under ideal circumstances, one would also expect that the linearregression slope, m, should equal the best-fit ratio R. This can also beused as a measure of success. At the same time, the linear regressionintercept b should equal zero when the r_(b) and g_(b) meet constraint1).

Ratio Distribution Statistics

The measured values of the numerator and denominator are randomvariables with mean and variance. That is,r _(m) −r _(b) ={overscore (r)}±r _(SD)g _(m) −g _(b) ={overscore (g)}±g _(SD)  Equation 13

-   -   where {overscore (r)} and {overscore (g)} are mean values and        r_(SD) and g_(SD) are standard deviations of r and g. The ratio        R of r and g is then a random variable too with an expected        value R_(E) and variance R_(SD). That is, $\begin{matrix}        {R = {{R_{ɛ} \pm R_{SD}} = \frac{\overset{\_}{r} \pm r_{SD}}{\overset{\_}{g} \pm g_{SD}}}} & {{Equation}\quad 14}        \end{matrix}$

Assuming that the measurement of numerator and denominator are normallydistributed variables, an estimate of R_(E) and R_(SD) can be formedfrom Taylor series expansion. $\begin{matrix}{R_{E} \approx {\frac{\overset{\_}{r}}{\overset{\_}{g}} + {g_{SD}^{2}\frac{\overset{\_}{r}}{{\overset{\_}{g}}^{3}}} - \frac{\sigma_{rg}}{{\overset{\_}{g}}^{2}}}} & {{Equation}\quad 15} \\{R_{SD} \approx \sqrt{{g_{SD}^{2}\frac{{\overset{\_}{r}}^{2}}{{\overset{\_}{g}}^{4}}} + \frac{r_{SD}^{2}}{{\overset{\_}{g}}^{2}} - {2\sigma_{rg}\frac{\overset{\_}{r}}{{\overset{\_}{g}}^{3}}}}} & {{Equation}\quad 16}\end{matrix}$

-   -   where σ_(rg) is the covariance of the numerator and denominator        summed over all image pixels in the spot. $\begin{matrix}        {\sigma_{rg} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}{\left( {r_{i} - \overset{\_}{r}} \right)\left( {g_{i} - \overset{\_}{g}} \right)}}}} & {{Equation}\quad 17}        \end{matrix}$

Coefficient of Variation

Assessing the equality of microarray scans and individual spots withinan array is an important part of scanning and analyzing arrays. A usefulmetric for this purpose is the coefficient of variation (CV) of theratio distribution, which is simply $\begin{matrix}{{CV} = \frac{R_{SD}}{R}} & {{Equation}\quad 18}\end{matrix}$

In effect, the CV represents the experimental resolution of the geneexpression ratio. Minimizing the CV should be the goal of scanning andanalyzing gene expression ratio experiments. Minimization of R_(SD) isthe best way to improve gene expression resolution. The graph 200 inFIG. 2 gives two examples of comparisons between scanned images. Theratios of a first spot are shown in line 205, while the ratios of asecond spot are shown as line 210. The first spot has a CV of 0.21,while the second spot has a CV of 0.60. The distribution of the ratiosin the first spot 205 have a narrower distribution than the ratios ofthe second spot 210. Thus, the first spot is a better spot to analyze.

Equation 16 shows that the variability of the ratio decreasesdramatically as a function of g, which is a well understood phenomenon.Dividing by a noisy measurement that is near zero produces a very noisyresult.

The ratio variance has an interesting dependence on the covarianceσ_(rg). Large values of σ_(rg) reduce the variability of the ratio. Thisdependence on the covariance is not widely known. In the case ofmicroarray images, strong covariance of the numerator and denominator isa result of three properties of the image data: good alignment of thenumerator and denominator images, genuine patterns and textures in thespot images, and a good signal-to-noise ratio (r/r_(SD) and g/g_(SD)).

Table 1 summarizes how variables combine to reduce RSD. TABLE 1 A shortsummary of the direction that variables need to move in order to reducethe ratio distribution variance. Variable R_(SD) Reduction g ↑ r ↓g_(SD) ↓ r_(SD) ↓ σ_(rg) ↑Spot CV

The CV is a fundamental metric and represents the spread of the ratiodistribution relative to the magnitude of the ratio. FIG. 3 shows thateven though the second spot has a higher ratio than the first spot, theCV is 3× higher. The uncertainty of the second spot's ratio is fargreater than the first spot's ratio. Thus, in addition to being usefulfor separating spots from the control population, the CV can also serveas an independent measure of a spot's quality.

Average CV

The average CV of the entire array of spots gives an excellent metric ofthe entire array quality. Scans from array WoRx alpha systems have beenshown to have approximately ¾ the average CV of a corresponding laserscan.

Normalized Covariance

Covariance is known to be an indicator of the registration amongchannels, as well as the noise. Large covariance is normally a goodsign. Low covariance, however, doesn't always mean the data are bad; itmay mean that the spot is smooth and has only a small amount ofintensity variance. Likewise, high variance is not necessarily bad ifthe variance is caused by a genuine intensity pattern within the spot.FIG. 3 demonstrates that standard deviation increases with increasingspot intensity; the dependence is approximately linear. FIG. 3 alsoshows that the observed standard deviation is not simply caused by thestatistical noise associated with counting discrete events (statisticalnoise). Spots that have a substantial intensity pattern caused bynon-uniform distribution of fluorescence will have a large variance anda large covariance (if the detection system is well aligned and has lownoise).

Thus, to make the covariance and the variance values useful they must benormalized somehow. In general, this can be accomplished by dividing thecovariance by some measure of the spot's intensity variance. Todetermine the spot's variance, one could select one of the channels asthe reference (for example the control channel, which is green), or onecould use a combination of the variance from all channels. The followingtable gives examples of the normalized covariance calculation:Normalized covariance Normalization Method calculations Variances addedin quadrature$\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sqrt{\sigma_{r}^{2} + \sigma_{g}^{2}}}$Equation 19 Variances added$\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\left\lbrack \frac{\left( {\sigma_{r} + \sigma_{g}} \right)}{2} \right\rbrack}$Equation 20 Control channel variance only$\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sigma_{g}}$ Equation 21Experiment channel variance only$\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sigma_{r}}$ Equation 22

Where σ′_(rg) is the normalized covariance, and σ_(r), and σ_(g) are thevariance of channels 1 and 2, respectively.

FIG. 4 illustrates a plot 400 of all the spot's covariance values versustheir average variance (as in equation 20). The plot of FIG. 4 revealsthat the normalized covariance is a very consistent value. The slope ofthe points in FIG. 4 gives the typical value of the normalizedcovariance. (The average of the normalized covariance would give asimilar result.) Outliers on the graph are almost always below thecluster of points along the line. Such outlying points occur when theintensity variance of the spot is unusually high, relative to thecovariance. A study of these points shows that they have some sort ofdefect, which is often a bright speck of contaminating fluorescence.

Covariance/Variance correlation of the Entire Array

Systematically poor correlation between covariance and variance can alsopoint to the scanner's inability to measure covariance due to poorresolution, noise, and/or channel misalignment. Linear regression of thepoints in FIG. 4 gives an indication of the scanner's ability to measurecovariance. A broad scatter plot obviously indicates poor correlation:the variance of the spot intensities is inconsistent between thechannels. A low slope indicates that the scanner has relatively highvariance, relative to it's ability to measure covariance. Thus, onecould compare scanners by comparing the slope and correlationcoefficient of a linear regression of FIG. 4 (when the same slide isscanned). A good scanner has a tight distribution with large slope andoutliers indicate array fabrication quality problems rather thanmeasurement difficulties. The average and standard deviation of thenormalized covariance give similar results and could be used instead ofthe slope and correlation coefficient, respectively.

Spot Intensity Close to Local Background

Spots that are close, or equal, to local background may beindistinguishable from background. A statistical method is employed todetermine whether pixels within the spot are statistically differentthan the background population.

Spot Intensity Below Local Background

Spot intensities below the local background are a good example of howthe local backgrounds are not additive. Such spots are not necessarilybad, but are certainly more difficult to quantify. This is a case whereproper background determination methods are essential. The methoddescribed above can make use of such spots, provided that there isindeed signal above the true calculated background.

Ratio Inconsistency (Alignment Problem or “Dye Separation”)

This metric compares the standard method of measuring the intensityratio with an alternative method. The standard method uses the ratio ofthe average intensities, as described above. The alternative measure ofratio is the average and standard deviation of the pixel-by-pixel ratioof the spot. For reasonable quality spots, these ratios and theirrespective standard deviations are similar.

There are two main source causes of inconsistency. Either the slidepreparation contains artifacts that affect the ratio, or the measurementsystem is unable to adequately measure the spot's intensity. Thefollowing table lists more details about each source of inconsistency.Slide Preparation Measurement Problems Probe separation Noise Targetseparation Misregistration Contamination with fluorescent materialNon-linear response between channels.

Note that all the problems listed in the table will also reduce theamount of covariance. In the case of slide preparation problems, theratio inconsistency points to chemistry problems, whereas measurementproblems point to scanner inadequacy.

Numerous variations and modifications of the invention will becomereadily apparent to those skilled in the art. Accordingly, the inventionmay be embodied in other specific forms without departing from itsspirit or essential characteristics.

1. A method of determining a background intensity of an image; themethod comprising: acquiring image data comprising data representativeof a plurality of microarray probes; for ones of the plurality ofmicroarray probes, extracting an intensity value for each of a pluralityof independent parameters; responsive to the extracting, performing aregression analysis on a curve fit of one independent parameter versusanother; and performing a least-squares analysis of the curve fit todetermine a background intensity of the image data.
 2. The method ofclaim 1 further comprising determining a ratio of an experimental imageto a control image wherein the experimental image is constructedresponsive to the acquiring.
 3. The method of claim 2 further comprisingdetermining the least squares curve fit from the equation:r _(m) =R(g _(m) −g _(b))+r _(b) =Rg _(m) +k where r_(m) and g_(m) aremeasured values of the images, r_(b), and g_(b) are backgroundintensities of the images, and k is a constant.
 4. The method of claim 3further comprising applying a constraint so the background intensitiesare greater than bias levels associated with the images.
 5. The methodof claim 3 further comprising applying a constraint so the backgroundintensities create a zero intercept of a linear regression of theequation:(r _(m) −r _(b))=m(g _(m) −g _(b))+b such that b is approximately zero.6. The method of claim 5 further comprising extracting backgroundsubtraction constants.
 7. A method of selecting a microarray scan foranalysis; the method comprising: acquiring, during a microarray scan,data representative of a plurality of microarray probes; determining acoefficient of variation for the microarray scan; comparing thecoefficient of variation to a predetermined threshold; and selecting amicroarray scan for analysis if the coefficient of variation is lowerthan the predetermined threshold.
 8. The method of claim 7 furthercomprising determining the coefficient of variation from the equation:${CV} = \frac{R_{SD}}{R}$ where$R_{SD} \approx {\sqrt{{g_{SD}^{2}\frac{{\overset{\_}{r}}^{2}}{{\overset{\_}{g}}^{4}}} + \frac{r_{SD}^{2}}{{\overset{\_}{g}}^{2}} - {2\sigma_{rg}\frac{\overset{\_}{r}}{{\overset{\_}{g}}^{3}}}}.}$9. The method of claim 7 further comprising determining a spotcoefficient of variation for each of the plurality of microarray probesresponsive to the acquiring.
 10. The method of claim 7 furthercomprising determining an average coefficient of variation.
 11. A methodof extracting data from an image; the method comprising: determining acovariance and a variance of the image in accordance with datarepresentative of a plurality of imaged microarray probes; normalizingthe covariance; determining an average and a standard deviation of thecovariance; and selecting data to be extracted based on the average andthe standard deviation of the covariance.
 12. The method of claim 11further comprising calculating the covariance according to the followingequation:$\sigma_{rg} = {\frac{1}{n}{\sum\limits_{i = 1}^{n}{\left( {r_{i} - \overset{\_}{r}} \right){\left( {g_{i} - \overset{\_}{g}} \right).}}}}$13. The method of claim 11 further comprising normalizing the covarianceby adding the variances in quadrature according to the followingequation: ${\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sigma_{r}}},$where σ′_(rg) is a normalized covariance and σ_(r) and σ_(g) arevariances of a control channel and an experimental channel.
 14. Themethod of claim 11 further comprising normalizing the covariance byadding the variances according to the following equation:${\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\left\lbrack \frac{\left( {\sigma_{r} + \sigma_{g}} \right)}{2} \right\rbrack}},$where σ′_(rg) is a normalized covariance and σ_(r) and σ_(g) arevariances of a control channel and an experimental channel.
 15. Themethod of claim 11 further comprising normalizing the covariance byusing a control channel variance according to the following equation:${\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sigma_{g}}},$ where σ′_(rg)is a normalized covariance and σ_(r) and σ_(g) are variances of acontrol channel and an experimental channel.
 16. The method of claim 11further comprising normalizing the covariance by using an experimentalchannel variance according to the following equation${\sigma_{rg}^{\prime} = \frac{\sigma_{rg}}{\sigma_{r}}},$ where σ′_(rg)is a normalized covariance and σ_(r) and σ_(g) are variances of acontrol channel and an experimental channel.
 17. A method of extractingdata from an image; the method comprising: determining a covariance anda variance of the image in accordance with data representative of aplurality of imaged microarray probes; determining a slope of thecovariance plotted against the variance; and selecting data to beextracted where the slope exceeds a predetermined threshold.
 18. Themethod of claim 17 further comprising plotting each covariance valueversus an average variance value.
 19. The method of claim 17 furthercomprising ignoring data points not along the slope of the covarianceplotted against the variance.
 20. The method of claim 17 furthercomprising performing linear regression of the covariance plottedagainst the variance to create a distribution of data points.
 21. Themethod of claim 20 further comprising selecting an image having a tightdistribution of data points.